Probabilistic Inference in Machine Learning. MSc in Data Science Methodology. Barcelona School of Economics.
Supervisor: Prof. Lorenzo Cappello, PhD
Loading required package: Rcpp
This is rstanarm version 2.32.1
- See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
- Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
- For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores())
Loading required package: MASS
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
Loading required package: lme4
arm (Version 1.14-4, built: 2024-4-1)
Working directory is /home/sobottka/BSE/2nd_Term/Probabilistic/Bayesian_Algorithm
Attaching package: 'arm'
The following objects are masked from 'package:rstanarm':
invlogit, logit
Attaching package: 'dplyr'
The following object is masked from 'package:MASS':
select
The following object is masked from 'package:gridExtra':
combine
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Loading required package: StanHeaders
rstan version 2.32.7 (Stan version 2.32.2)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
change `threads_per_chain` option:
rstan_options(threads_per_chain = 1)
Attaching package: 'rstan'
The following object is masked from 'package:arm':
traceplot
The following object is masked from 'package:tidyr':
extract
This is loo version 2.8.0
- Online documentation and vignettes at mc-stan.org/loo
- As of v2.0.0 loo defaults to 1 core but we recommend using as many as possible. Use the 'cores' argument or set options(mc.cores = NUM_CORES) for an entire session.
Attaching package: 'loo'
The following object is masked from 'package:rstan':
loo
This is bayesplot version 1.11.1
- Online documentation and vignettes at mc-stan.org/bayesplot
- bayesplot theme set to bayesplot::theme_default()
* Does _not_ affect other ggplot2 plots
* See ?bayesplot_theme_set for details on theme setting
This study presents a hierarchical Bayesian analysis of radon levels using an enhanced Hamiltonian Monte Carlo approach. We compare two modeling strategies—a dense model and a hybrid model—by examining their posterior distributions and key parameter estimates. The results indicate that while most parameters, including the effect of floor level on radon concentration, remain consistent across models, the hybrid model exhibits slightly improved precision (LOO-ELPD = −1034.4 vs −1037.3). Notably, the hybrid model yields a lower estimate for the average log-radon level across counties, suggesting that the inclusion of latent variables shifts the intercept downward. These findings reinforce the benefits of hierarchical modeling in capturing regional variations in radon levels and highlight the potential for further improvements through dynamic or spatial modeling.
In this project, we perform a hierarchical Bayesian analysis of the radon dataset. This data includes over 900 homes across 85 counties in Minnesota. Our objective is to explore these radon levels in houses across various counties and understand how the levels vary both within and between counties. Inspired by Gelman and Hill’s work on hierarchical models, we build on this foundation by implementing an enhanced Hamiltonian Monte Carlo approach in Stan. Given the natural hierarchical structure—measurements nested within counties—Bayesian hierarchical modeling offers a principled framework for accounting for both within-county and between-county variability. This method is thus expected to provide efficient and robust posterior inference for our multilevel model.
As mentioned, we begin with a baseline hierarchical model inspired by Gelman and Hill (2007), which uses a centered parameterization. We then systematically explore a series of algorithmic enhancements aimed at improving convergence, sampling efficiency, and posterior stability. These include:
Comparing centered vs. non-centered parameterizations;
Testing diagonal vs. dense mass matrix adaptations;
Implementing a warm-start strategy using variational inference;
Exploring reparameterizations (e.g., log-scale, partial non-centering);
Incorporating robust priors;
And finally, combining Hamiltonian Monte Carlo (HMC/NUTS) with Gibbs sampling in a hybrid MCMC routine.
Throughout, we assess convergence via diagnostics such as R-hat, Effective Sample Size (ESS), trace plots, and LOO-ELPD for predictive accuracy.
In this section, we explore the radon dataset to understand its structure and key variables.
First, we load the radon dataset. We can do this by using the load method as the dataset is included in the rstan package. Alternatively, we added the file as a CVS in the Project.zip:
data("radon")
#radon<-read.csv("radon_exported.csv") if the above gives errors due to package incompatibility.
Let’s inspect the first few rows:
head(radon)
floor county log_radon log_uranium
1 1 AITKIN 0.83290912 -0.6890476
2 0 AITKIN 0.83290912 -0.6890476
3 0 AITKIN 1.09861229 -0.6890476
4 0 AITKIN 0.09531018 -0.6890476
5 0 ANOKA 1.16315081 -0.8473129
6 0 ANOKA 0.95551145 -0.8473129
The dataset contains the following variables:
\(floor\): Indicates whether the measurement was taken in a basement (coded as 0) or on an upper floor (coded as 1).
\(county\): Identifies the county for each observation.
\(log_radon\): The natural logarithm of the radon measurement. This transformation is applied to stabilize variance and approach normality in the distribution of radon levels.
\(log_uranium\): The natural logarithm of the uranium measurement. Uranium concentration in the soil is considered a potential predictor of radon levels, as radon is a decay product of uranium.
We also check the dimensions of the dataset:
dim(radon)
[1] 919 4
This dataset consists of 919 observations.
To understand the distribution of radon levels, we visualize the histogram of the log-transformed radon measurements:
ggplot(radon, aes(x = log_radon)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
labs(title = "Distribution of Log Radon Levels", x = "Log Radon", y = "Frequency")
The distribution of log-transformed radon levels is roughly unimodal,
centered around 1 to 2 on the log scale, and shows a moderate right
tail. The log transformation reduces skew compared to raw radon
measurements and helps stabilize variance, making it more amenable to
hierarchical modeling.
Next, we explore how radon levels vary by county. A boxplot of log_radon across counties provides a clear visualization of the differences:
ggplot(radon, aes(x = factor(county), y = log_radon)) +
geom_boxplot(fill = "lightgreen", color = "black") +
labs(title = "Radon Levels by County", x = "County", y = "Log Radon") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
There is clear variation in log radon levels across counties, with some
counties showing higher or lower median values than others. Within each
county, there is also notable spread in the data. This suggests both
between-county and within-county variability and reinforces the need for
a hierarchical modeling approach that can capture these different levels
of variation.
Additionally, we also compute and plot the mean log radon level for each county:
county_summary <- radon %>%
group_by(county) %>%
summarise(mean_log_radon = mean(log_radon, na.rm = TRUE),
n = n())
ggplot(county_summary, aes(x = reorder(county, mean_log_radon), y = mean_log_radon)) +
geom_point() +
coord_flip() +
scale_x_discrete(expand = expansion(mult = c(0.05, 0.05))) +
labs(title = "Mean Log Radon Levels by County",
x = "County",
y = "Mean Log Radon") +
theme_minimal() +
theme(axis.text.y = element_text(size = 4))
Again, this shows that average radon levels vary noticeably across
counties, with some counties having much lower mean log radon than
others. The clear gradient from left to right indicates substantial
between-county variation, which supports using a hierarchical model to
capture these differences alongside the within-county variability.
An important predictor in our dataset is the \(floor\) variable, which indicates if the measurement is from a basement (0) or an upper floor (1). We examine how radon levels differ based on this predictor:
ggplot(radon, aes(x = factor(floor), y = log_radon)) +
geom_boxplot(fill = "orange", color = "black") +
labs(title = "Radon Levels by Floor (Basement vs Upper Floor)",
x = "Floor (0 = Basement, 1 = Upper Floor)", y = "Log Radon")
We can see that homes measured in the basement tend to have higher radon
levels than those measured on an upper floor. Although there is overlap
between the two groups, the boxplot shows a higher median and upper
quartile for basement measurements, which aligns with the idea that
radon gas often enters through the ground and accumulates more in lower
levels of a building.
Note: Additionally, including log_uranium as a predictor in the model allows us to account for potential covariate effects that might explain some of the variability in radon levels.
The radon data are naturally grouped by county, meaning that multiple observations (houses) are nested within each county. This leads to two sources of variability:
Within-County Variability: Variation in radon levels among houses within the same county.
Between-County Variability: Variation in the average radon level from one county to another.
To further illustrate this structure, we summarize key statistics by county:
county_stats <- radon %>%
group_by(county) %>%
summarise(n = n(),
mean_log_radon = mean(log_radon, na.rm = TRUE),
sd_log_radon = sd(log_radon, na.rm = TRUE))
county_stats
# A tibble: 85 × 4
county n mean_log_radon sd_log_radon
<fct> <int> <dbl> <dbl>
1 AITKIN 4 0.715 0.432
2 ANOKA 52 0.891 0.718
3 BECKER 3 1.09 0.717
4 BELTRAMI 7 1.19 0.894
5 BENTON 4 1.28 0.415
6 BIGSTONE 3 1.54 0.504
7 BLUEEARTH 14 1.93 0.542
8 BROWN 4 1.65 0.595
9 CARLTON 10 0.977 0.585
10 CARVER 6 1.22 1.90
# ℹ 75 more rows
The table shows that each of the 85 counties has a different number of observations (n) and distinct mean and standard deviation of log radon levels. Some counties have very few data points (e.g., 3 or 4), while others have many more (e.g., 52). The mean log radon values also vary considerably across counties, as do their standard deviations. This variation in both sample size and radon levels by county underscores the need for a hierarchical model that can account for county-to-county differences (random effects) while borrowing strength across counties.
In this section, we build our baseline hierarchical model and outline our inference goals. We load our Stan model (saved as radon_model.stan), prepare the data for modeling, run the posterior sampler using Hamiltonian Monte Carlo (HMC) via the No-U-Turn Sampler (NUTS), and perform several diagnostic checks.
We aim to model the log-transformed radon level \(y_{ij}\) for house \(i\) in county \(j\). We include a fixed effect for whether the measurement was taken in the basement (0) or on an upper floor (1). In addition, each county has its own intercept, reflecting the idea that some counties may have generally higher or lower radon levels. Note that this model is inspired by the hierarchical models presented by Gelman and Hill, 2007 (p. 256f.). We can write this as:
\[ \begin{aligned} y_{ij} &\sim \mathcal{N}(\alpha_j + \beta \, x_{ij}, \sigma^2), \\[6pt] \alpha_j &\sim \mathcal{N}(\mu_\alpha, \tau^2), \end{aligned} \] where:
\(y_{ij}\) is the log radon measurement for house \(i\) in county \(j\).
\(x_{ij}\) is the floor variable (0 = basement, 1 = upper floor).
\(\alpha_j\) is the county-specific intercept, which follows a normal distribution centered at \(\mu_\alpha\) with standard deviation \(\tau\).
\(\beta\) is the fixed effect for the floor predictor.
\(\sigma\) is the within-county (house-level) residual standard deviation.
\(\mu_\alpha\) is the overall (global) intercept across all counties.
\(\tau\) captures the between-county variability.
Moreover, Gelman and Hill (2007) advocate for using weakly informative priors in multilevel models. They recommend broad normal priors with large variances for location parameters (pp. 420–422) and half-Cauchy priors for scale parameters to regularize the model without imposing overly strict constraints (pp. 427–433, 500). Following these recommendations, we specify the following priors: \[ \begin{alignedat}{2} \mu_\alpha &\sim \mathcal{N}(0, 10^2),\\[6pt] \tau &\sim \text{Cauchy}^+(0, 2.5),\\[6pt] \beta &\sim \mathcal{N}(0, 10^2),\\[6pt] \sigma &\sim \text{Cauchy}^+(0, 2.5). \end{alignedat} \] These choices provide sufficient flexibility for the data to inform the posterior while ensuring adequate regularization of our hierarchical model.
We denote this model as our baseline model. A plate diagram for the model is provided in the file baseline_model.jpeg.
The goal of this Bayesian hierarchical model is to estimate the posterior distributions of all unknown parameters:
\(\beta\) (effect of basement vs. upper floor),
\(\mu_\alpha\) (overall average intercept),
\(\tau\) (between-county variability),
\(\sigma\) (within-county variability), and
each county-specific intercept \(\alpha_j\).
By estimating these posterior distributions, we can quantify both the uncertainty at the county level (how counties differ from one another) and the uncertainty at the house level (residual variability around each county’s intercept).
Note: We built this baseline model formulation in the file radon_model_centered.stan.
We first prepare the data in a list that matches the format expected by the Stan model. Note that we convert the county variable to an integer to serve as an index for the hierarchical structure.
radon_data <- list(
N = nrow(radon),
J = length(unique(radon$county)),
county = as.integer(radon$county),
y = radon$log_radon,
x = radon$floor
)
We run our model using the No-U-Turn Sampler (NUTS), an advanced variant of Hamiltonian Monte Carlo (HMC), which is the default algorithm for Stan. NUTS leverages gradient information from the posterior to efficiently navigate high-dimensional parameter spaces. It adaptively determines the optimal path length (i.e., the number of leapfrog steps), thereby we can avoide unnecessary computations and manual tuning. This automatic adaptation helps ensure that the sampler explores the posterior distribution thoroughly, leading to better convergence and improved effective sample sizes. In our analysis, we begin with a baseline setup and aim to further refine the sampler in the subsequent parts.
We now run our Stan model using 4 chains with 2000 iterations each (1000 warmup iterations and 1000 sampling iterations):
fit <- stan(file = "radon_model_centered.stan",
data = radon_data,
iter = 2000,
warmup = 1000,
chains = 4,
seed = 123)
Summary of the posterior distributions for the key parameters:
print(fit, pars = c("beta", "mu_alpha", "tau", "sigma"), probs = c(0.025, 0.5, 0.975))
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 50% 97.5% n_eff Rhat
beta -0.66 0 0.07 -0.79 -0.66 -0.53 4247 1
mu_alpha 1.49 0 0.05 1.39 1.49 1.59 3114 1
tau 0.32 0 0.05 0.24 0.32 0.42 1365 1
sigma 0.73 0 0.02 0.69 0.73 0.76 6424 1
Samples were drawn using NUTS(diag_e) at Tue Mar 25 13:15:12 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Parameter Estimates:
\(\beta\): The fixed effect for the floor variable has a mean of -0.66, with a 95% credible interval from -0.80 to -0.53. This negative value suggests that, all else being equal, measurements taken on an upper floor (assuming floor is coded as 1 for upper floor) are associated with lower log radon levels compared to basement measurements.
\(\mu_\alpha\): The overall (global) intercept is estimated at 1.49, with a credible interval from 1.39 to 1.59. This represents the average log radon level for the baseline (reference) group.
\(\tau\): The between-county standard deviation is estimated at 0.32 (credible interval: 0.24 to 0.42), quantifying the variability in county-specific intercepts.
\(\sigma\): The residual standard deviation (within-county variability) is estimated at 0.73 (credible interval: 0.69 to 0.76).
In this subsection, we assess the convergence of our Markov chains using a suite of diagnostic tools we ahve learned in class a swell as some additonal ones that are available in the bayesplot package. After sampling, it is crucial to verify that the chains have stabilized in the same region of parameter space, exhibit minimal autocorrelation, and yield sufficiently large effective sample sizes. These checks ensure that our posterior samples are reliable and that the sampler has thoroughly explored the target distribution.
Trace plots provide to us a visual representation of the sampling paths across chains. We expect that well-mixed chains should appear as overlapping, horizontal “fuzzy caterpillars.”
We convert the fitted model to an array for bayesplot and investigate the traces:
posterior_samples <- as.array(fit)
mcmc_trace(posterior_samples, pars = c("beta", "mu_alpha", "tau", "sigma"))
From these trace plots, we see that all four chains appear to have
converged and are mixing well. Each chain fluctuates around a similar
central value for each parameter, and we observe no pronounced upward or
downward trends. The overlap among chains is substantial, suggesting
that we have reached a stable sampling distribution. We also see no
abrupt jumps or signs of poor mixing, which reinforces that our NUTS
sampler is exploring the posterior efficiently.
Pair plots help us assess potential correlations between parameters and they also provide an overview of the joint posterior distributions.
mcmc_pairs(posterior_samples, pars = c("beta", "mu_alpha", "tau", "sigma"))
Marginal Distributions (Diagonals):
Each parameter’s histogram is roughly unimodal and does not exhibit strong skew or multiple modes. This suggests that the posterior for each parameter is relatively stable and well-identified.
Pairwise Relationships (Off-Diagonals):
The scatter plots show how each parameter’s posterior samples relate to the others. We do not see strong linear or nonlinear patterns, indicating that no pair of parameters is highly correlated. This relative lack of strong correlation is a good sign for efficient sampling and clear parameter interpretation.
Overall Posterior Behavior:
The tight, roughly elliptical scatter clouds in the off-diagonal plots, combined with unimodal marginals, suggest that our model converged well and that the parameters are well estimated. There is no visual evidence of pathological behavior such as multi-modality or heavy tail correlations that could impede inference.
We ran multiple Markov chains in parallel. When each chain converges to the same region of parameter space, it provides evidence that the sampler is exploring the true posterior rather than getting stuck in a local mode. In Stan’s output, the \(\hat{R}\) statistic quantifies the potential scale reduction factor.
rhat_values <- rhat(fit, pars = c("beta", "mu_alpha", "tau", "sigma"))
print(rhat_values)
beta mu_alpha tau sigma
0.9992642 0.9998036 1.0009017 0.9993727
An \(\hat{R}\) value of about 1 indicates perfect convergence, meaning that the variability between the chains is nearly identical to the variability within each chain. Typically, values below 1.1 are considered acceptable, so our values confirm that our chains have converged well. This gives us confidence that the posterior estimates are reliable and that our sampler has thoroughly explored the target distribution.
The effective sample size (ESS) measures how many nearly independent draws we have relative to the total samples. High autocorrelation within a chain reduces the ESS. Stan reports both an absolute ESS and an ESS ratio. A higher ESS ratio indicates more efficient sampling.
ess_values <- neff_ratio(fit, pars = c("beta", "mu_alpha", "tau", "sigma"))
print(ess_values)
beta mu_alpha tau sigma
1.0617332 0.7784023 0.3412737 1.6060289
The ESS ratios for \(\beta\) (1.34) and \(\sigma\) (1.79) are greater than 1, which suggests that these parameters are sampled with high efficiency (possibly even benefiting from negative autocorrelation). In contrast, \(\mu_\alpha\) (0.96) is slightly lower, and \(\tau\) (0.35) is notably lower, indicating that these chains exhibit higher autocorrelation and provide fewer independent draws. Although lower ESS ratios can still be acceptable depending on the context, they suggest that additional iterations or further tuning might improve the precision of our estimates for these parameters.
Even if the chains converge, strong autocorrelation in the draws can reduce the effective sample size. We can visualize the autocorrelation function (ACF) for each parameter using bayesplot.
posterior_samples <- as.array(fit)
mcmc_acf_bar(posterior_samples, pars = c("beta", "mu_alpha", "tau", "sigma"))
From the ACF plots, we see that \(\beta\), \(\mu_\alpha\), and \(\sigma\) exhibit a relatively rapid decay
toward zero, indicating that consecutive draws for these parameters are
fairly uncorrelated. In contrast, \(\tau\) shows a slower decay, reflecting
higher autocorrelation. This observation aligns with \(\tau\) having a lower ESS ratio. While the
chains converge overall, the slower decay for \(\tau\) implies that we might need
additional iterations — or a different
parameterization — to achieve a higher number of effectively
independent draws for this parameters.
Density plots help visualize the marginal posterior distributions of each parameter. By overlaying densities from each chain, we can quickly assess whether the chains have converged to the same distribution and identify any irregularities such as multimodality or heavy tails.
posterior_samples <- as.array(fit)
mcmc_dens_overlay(posterior_samples, pars = c("beta", "mu_alpha", "tau", "sigma"))
These overlaid density plots show that all four chains produce somewhat
identical posterior distributions for each parameter, indicating strong
convergence and no evidence of multi-modality. Each parameter’s
distribution appears unimodal, and the curves from different chains
overlap closely. This alignment suggests that the sampler has converged
to a stable region of parameter space and that our posterior estimates
for \(\beta\), \(\mu_\alpha\), \(\tau\), and \(\sigma\) are both consistent and
reliable.
In this section, we explore several potential enhancements to our baseline algorithm. Although our baseline hierarchical model—using a centered parameterization and basic Hamiltonian Monte Carlo (HMC)—performs reasonably well, our primary focus here is to experiment with alternative methods to further improve sampling efficiency and reduce autocorrelation. Our main goal is to enhance the HMC algorithm for the radon data problem.
We will begin by implementing an Enhanced Mass Matrix Adaptation inspired by the Riemannian Manifold HMC (RMHMC) approach, which seeks to capture the local geometry of the posterior distribution more accurately than standard methods. Next, we will consider a non-centered parameterization along with other reparameterization strategies to further reduce posterior correlations. We then plan to explore advanced warm-up strategies by combining NUTS with variational inference techniques, followed by further reparameterization techniques, and finally, hybrid sampling approaches. Each enhancement will be evaluated in terms of convergence diagnostics and overall efficiency.
Since we will run many different models, we set up an automated diagnostics routine to standardize the evaluation of each Stan model run. This routine will:
Extract key convergence metrics (R-hat and effective sample size [ESS]) from the model summary.
Compute model comparison metrics, such as leave-one-out cross-validation (LOO) using the \(loo\) package.
Generate and automatically save diagnostic plots, including trace plots, autocorrelation (ACF) plots, and density overlays.
Below is the function that implements this automated diagnostics routine:
run_diagnostics <- function(fit, model_name = "model", output_folder = "plots", print_plots = TRUE) {
if (!dir.exists(output_folder)) {
dir.create(output_folder)
}
sum_fit <- summary(fit, pars = c("beta", "mu_alpha", "tau", "sigma"))$summary
rhat_values <- sum_fit[, "Rhat"]
ess_values <- sum_fit[, "n_eff"]
loo_fit <- loo(fit)
diag_df <- data.frame(
Parameter = rownames(sum_fit),
Rhat = rhat_values,
ESS = ess_values,
Model = model_name,
LOO_ELPD = loo_fit$estimates["elpd_loo", "Estimate"]
)
posterior_samples <- as.array(fit)
trace_plot <- mcmc_trace(posterior_samples, pars = c("beta", "mu_alpha", "tau", "sigma"))
if (print_plots) print(trace_plot)
ggsave(file.path(output_folder, paste0(model_name, "_trace_plot.png")),
trace_plot, width = 10, height = 8)
acf_plot <- mcmc_acf_bar(posterior_samples, pars = c("beta", "mu_alpha", "tau", "sigma"))
if (print_plots) print(acf_plot)
ggsave(file.path(output_folder, paste0(model_name, "_acf_plot.png")),
acf_plot, width = 10, height = 8)
dens_plot <- mcmc_dens_overlay(posterior_samples, pars = c("beta", "mu_alpha", "tau", "sigma"))
if (print_plots) print(dens_plot)
ggsave(file.path(output_folder, paste0(model_name, "_density_overlay.png")),
dens_plot, width = 10, height = 8)
return(diag_df)
}
Application to the Fitted Baseline Model:
baseline_diag <- run_diagnostics(fit, model_name = "baseline_model", print_plots = FALSE)
Now, all key diagnostic metrics and plots for the baseline model are stored and can be used for comparison as we implement and evaluate alternative enhancements.
While NUTS typically uses a diagonal or dense mass matrix for adapting its stepsize and trajectory length, further improvements can be achieved by employing a Riemannian metric. The Riemannian Manifold HMC adapts to the local curvature of the posterior, allowing for more efficient exploration in regions where the geometry is complex.
Stan does not implement a full RMHMC algorithm; however, we can approximate some of its benefits by using a dense mass matrix adaptation. This allows the sampler to capture some of the local correlations in the posterior, which is an important step towards the RMHMC approach.
fit_dense <- stan(file = "radon_model_centered.stan",
data = radon_data,
iter = 2000,
warmup = 1000,
chains = 4,
seed = 123,
control = list(metric = "dense_e", adapt_engaged = TRUE))
Running diagnostics on the fitted model:
dense_diag <- run_diagnostics(fit_dense, model_name = "dense_mass_matrix")
The plots appear very similar to those of the baseline model.
Trace Plots: The chains for each parameter (beta, mu_alpha, tau, sigma) appear well-mixed, with no visible trends or drifts, indicating good convergence.
Autocorrelation Plots: The ACFs show a rapid decay toward zero, suggesting that the draws are relatively independent and thus the sampler is efficient.
Density Overlays: The posterior distributions are unimodal and overlap closely across the four chains, reinforcing that the sampler converged to the same region in parameter space.
Next we compare the model to the baseline Model.
combined_diagnostics <- bind_rows(baseline_diag, dense_diag)
print(combined_diagnostics)
Parameter Rhat ESS Model LOO_ELPD
beta...1 beta 0.9992642 4246.933 baseline_model -1037.199
mu_alpha...2 mu_alpha 0.9998036 3113.609 baseline_model -1037.199
tau...3 tau 1.0009017 1365.095 baseline_model -1037.199
sigma...4 sigma 0.9993727 6424.116 baseline_model -1037.199
beta...5 beta 1.0003583 6580.948 dense_mass_matrix -1037.346
mu_alpha...6 mu_alpha 0.9997937 6229.065 dense_mass_matrix -1037.346
tau...7 tau 1.0001146 2147.312 dense_mass_matrix -1037.346
sigma...8 sigma 0.9996516 7291.118 dense_mass_matrix -1037.346
As seen in the plots both models show excellent convergence, with R-hat values very close to 1 and high effective sample sizes. The LOO-ELPD values are also nearly identical, suggesting that the dense mass matrix version provides a slightly better (though not dramatically different) fit. Notably, the ESS for \(\tau\) is higher in the dense mass matrix model, indicating improved sampling efficiency for that parameter compared to the baseline model.
According to Papaspiliopoulos, Roberts, and Sköld (2003), non-centered parameterization can improve sampling efficiency in hierarchical models. Instead of sampling \(\alpha_j\) directly, we introduce a latent variable \(\alpha_{\text{raw}, j}\) that follows a standard normal, and then transform it: \[\alpha_j = \mu_\alpha + \alpha_{\text{raw}, j} \cdot \tau.\] To recap, In the centered parameterization on the other hand, we model the county-level intercepts \(\alpha_j\) directly as being drawn from a common distribution with parameters \(\mu_\alpha\) and \(\tau\).
This model assumes that each county-specific intercept \(\alpha_j\) is drawn from a population distribution centered at \(\mu_\alpha\) with standard deviation \(\tau\), allowing for partial pooling of information across counties. The slope \(\beta\) represents the effect of floor level, and \(\sigma\) is the residual variability within counties.
We call this the centered parameterization because we sample \(\alpha_j\) directly from its normal distribution: \[ \alpha_j \sim \mathcal{N}(\mu_\alpha, \tau^2), \] as opposed to defining a standard normal variable and transforming it (as in the non-centered case).
In the non-centered parameterization, we instead define: \[ \alpha_j = \mu_\alpha + \tau \cdot \alpha_{\text{raw},j}, \quad \text{with} \quad \alpha_{\text{raw},j} \sim \mathcal{N}(0, 1). \] Theoretically, this reparameterization can improve convergence and sampling efficiency, especially when the data are weakly informative about group-level variation (e.g., sparse counties or small \(\tau\)). In contrast, the centered parameterization tends to work better when group-level effects are well-identified from the data (e.g., larger sample size per group, stronger signals).
Note: We built this advanced model formulation in the file radon_model_non-centered.stan.
As done before, we load our Stan model (saved as radon_model_non-centered.stan), and run the posterior sampler using NUTS Sampler (all as in section 3):
fit_nc <- stan(file = "radon_model_non-centered.stan",
data = radon_data,
iter = 2000,
warmup = 1000,
chains = 4,
seed = 123)
Running diagnostics on the fitted model:
nc_diag <- run_diagnostics(fit_nc, model_name = "non_centered_pararmeterization")
From these plots, we can see that the chains mix well
and there are no signs of non-convergence. The trace
plots show stable, overlapping “fuzzy caterpillars,” indicating that the
sampler is exploring a consistent region of the parameter space for each
parameter. The autocorrelation plots reveal a relatively fast
decay, suggesting an efficient sampling process with low
correlation between successive draws. Finally, the density overlays for
each chain align closely, reinforcing that the chains converge to the
same posterior distribution. The only difference we can notice compared
to the baseline model is that the \(\mu_\alpha\) shows a slower decay in this
model.
Next we compare the model to the baseline model.
combined_diagnostics <- bind_rows(baseline_diag, nc_diag)
print(combined_diagnostics)
Parameter Rhat ESS Model
beta...1 beta 0.9992642 4246.933 baseline_model
mu_alpha...2 mu_alpha 0.9998036 3113.609 baseline_model
tau...3 tau 1.0009017 1365.095 baseline_model
sigma...4 sigma 0.9993727 6424.116 baseline_model
beta...5 beta 0.9999682 7439.011 non_centered_pararmeterization
mu_alpha...6 mu_alpha 0.9999082 2859.178 non_centered_pararmeterization
tau...7 tau 1.0004136 1843.028 non_centered_pararmeterization
sigma...8 sigma 1.0008327 6204.392 non_centered_pararmeterization
LOO_ELPD
beta...1 -1037.199
mu_alpha...2 -1037.199
tau...3 -1037.199
sigma...4 -1037.199
beta...5 -1037.368
mu_alpha...6 -1037.368
tau...7 -1037.368
sigma...8 -1037.368
Again, both models show excellent convergence diagnostics with Rhat values very close to 1 across all parameters, indicating that the chains have stabilized and mixed well. The ESS are high for both models, although there are some differences: for example, the non-centered model achieves a higher ESS for \(\beta\), while the ESS for \(\mu_\alpha\) is somewhat lower compared to the baseline. The leave-one-out expected log predictive density (LOO_ELPD) is nearly identical between the two models (approximately -1037 for the baseline and -1036.5 for the non-centered model), indicating similar predictive performance. Overall, these results suggest that both parameterizations converge reliably, with the non-centered parameterization offering marginal improvements in sampling efficiency for certain parameters.
One of the next steps in our algorithm enhancement is to automate the choice between parameterizations. Instead of manually deciding whether to use the centered or non-centered formulation, we can set up a routine that compares key diagnostics—such as effective sample size (ESS), R-hat, and LOO-ELPD—from both models and selects the one with better performance. This automatic reparameterization approach can help streamline our workflow by ensuring that the model used for inference is optimally tuned for sampling efficiency.
The basic idea is to:
Run both the baseline (centered) and non-centered models.
Extract the diagnostic metrics using our \(run_diagnostics()\) function.
Compare the average ESS (or avergae Rhat or LOO_ELPD) and choose the model that exhibits better convergence properties.
Below is a function to demonstrate this comparison:
choose_best_model <- function(diag_baseline, diag_noncentered) {
avg_ess_baseline <- mean(diag_baseline$ESS, na.rm = TRUE)
avg_ess_noncentered <- mean(diag_noncentered$ESS, na.rm = TRUE)
avg_rhat_baseline <- mean(diag_baseline$Rhat, na.rm = TRUE)
avg_rhat_noncentered <- mean(diag_noncentered$Rhat, na.rm = TRUE)
cat("Baseline Model - Avg ESS:", avg_ess_baseline, "Avg Rhat:", avg_rhat_baseline, "\n")
cat("Non-Centered Model - Avg ESS:", avg_ess_noncentered, "Avg Rhat:", avg_rhat_noncentered, "\n")
if(avg_ess_noncentered > avg_ess_baseline && avg_rhat_noncentered <= avg_rhat_baseline) {
chosen_model <- "non_centered_parameterization"
} else {
chosen_model <- "baseline_model"
}
return(chosen_model)
}
After running the diagnostic checks, we can automatically decide the best model:
best_model <- choose_best_model(baseline_diag, nc_diag)
Baseline Model - Avg ESS: 3787.438 Avg Rhat: 0.9998355
Non-Centered Model - Avg ESS: 4586.402 Avg Rhat: 1.000281
cat("The automatically selected best model is:", best_model, "\n")
The automatically selected best model is: baseline_model
For further optimizations, we will run both models (centered and non-centered) and let the routine decide which performs better based on these diagnostics, and then compare the selected model against the baseline.
Since we have already observed that the centered parameterization performs better with a dense mass matrix adaptation, the next step is to verify whether applying the same dense mass matrix to the non-centered parameterization also yields improvements. In other words, we want to determine if the dense mass matrix adaptation enhances the performance of the non-centered model relative to its default (diagonal) version. If both parameterizations benefit from the dense mass matrix, we will adopt it as our default for future comparisons.
To do this, we run the non-centered model using a dense mass matrix adaptation and compare its diagnostics to those from the non-centered model with the default mass matrix.
fit_nc_dense <- stan(file = "radon_model_non-centered.stan",
data = radon_data,
iter = 2000,
warmup = 1000,
chains = 4,
seed = 123,
control = list(metric = "dense_e", adapt_engaged = TRUE))
Run diagnostics for the dense non-centered model:
nc_dense_diag <- run_diagnostics(fit_nc_dense, model_name = "non_centered_dense", print_plots = FALSE)
As before, we observe successful convergence. (Plots are almost identical, that is why we do not present them here in this report, however, they are stored in the ‘plots’ folder.)
Now, we compare the diagnostics with the non-dense non-centered model:
combined_nc_diag <- bind_rows(nc_diag, nc_dense_diag)
print(combined_nc_diag)
Parameter Rhat ESS Model
beta...1 beta 0.9999682 7439.011 non_centered_pararmeterization
mu_alpha...2 mu_alpha 0.9999082 2859.178 non_centered_pararmeterization
tau...3 tau 1.0004136 1843.028 non_centered_pararmeterization
sigma...4 sigma 1.0008327 6204.392 non_centered_pararmeterization
beta...5 beta 0.9991695 6040.582 non_centered_dense
mu_alpha...6 mu_alpha 0.9998339 5573.401 non_centered_dense
tau...7 tau 1.0007901 1713.470 non_centered_dense
sigma...8 sigma 0.9998017 5706.046 non_centered_dense
LOO_ELPD
beta...1 -1037.368
mu_alpha...2 -1037.368
tau...3 -1037.368
sigma...4 -1037.368
beta...5 -1037.506
mu_alpha...6 -1037.506
tau...7 -1037.506
sigma...8 -1037.506
Based on these diagnostics, the non-centered parameterization with the default (diagonal) mass matrix appears to perform slightly better overall. While the dense mass matrix version shows improved ESS for \(\mu_\alpha\) and \(\tau\), the default non-centered model achieves higher ESS for \(\beta\) and \(\sigma\) and has a marginally better (less negative) LOO-ELPD. Thus, the default non-centered model offers a slight edge in predictive performance and sampling efficiency.
For the further evaluations we will proceed as follows:
For the centered model, the dense mass matrix adaptation improves performance, so we would adopt the dense mass matrix.
For the non-centered model, the default (diagonal) mass matrix appears to work slightly better, so we would use that one.
We then automatically select the best-performing model using our diagnostics:
best_model <- choose_best_model(dense_diag, nc_diag)
Baseline Model - Avg ESS: 5562.111 Avg Rhat: 0.9999796
Non-Centered Model - Avg ESS: 4586.402 Avg Rhat: 1.000281
cat("The automatically selected best model is:", best_model, "\n")
The automatically selected best model is: baseline_model
From now on, we will use the baseline model (centered with a dense mass matrix) as our reference for further comparisons and optimizations.
A good warm-start minimizes the time the sampler spends exploring low-density regions, leading to faster convergence. One strategy to achieve this is to use a variational inference (VI) approximation to initialize the chains for the NUTS sampler. By starting from an area with high posterior probability—as identified by VI—we can reduce the warm-up time and improve overall sampling efficiency.
For Variational Bayes (VB) we first need to compile the stan model:
model_centered <- stan_model(file = "radon_model_centered.stan")
Then, we run variational inference to obtain initial estimates:
fit_vb <- vb(model_centered, data = radon_data, seed = 123)
Chain 1: ------------------------------------------------------------
Chain 1: EXPERIMENTAL ALGORITHM:
Chain 1: This procedure has not been thoroughly tested and may be unstable
Chain 1: or buggy. The interface is subject to change.
Chain 1: ------------------------------------------------------------
Chain 1:
Chain 1:
Chain 1:
Chain 1: Gradient evaluation took 5.4e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.54 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Begin eta adaptation.
Chain 1: Iteration: 1 / 250 [ 0%] (Adaptation)
Chain 1: Iteration: 50 / 250 [ 20%] (Adaptation)
Chain 1: Iteration: 100 / 250 [ 40%] (Adaptation)
Chain 1: Iteration: 150 / 250 [ 60%] (Adaptation)
Chain 1: Iteration: 200 / 250 [ 80%] (Adaptation)
Chain 1: Success! Found best value [eta = 1] earlier than expected.
Chain 1:
Chain 1: Begin stochastic gradient ascent.
Chain 1: iter ELBO delta_ELBO_mean delta_ELBO_med notes
Chain 1: 100 -1090.433 1.000 1.000
Chain 1: 200 -1087.821 0.501 1.000
Chain 1: 300 -1074.842 0.338 0.012
Chain 1: 400 -1078.344 0.254 0.012
Chain 1: 500 -1074.311 0.204 0.004 MEDIAN ELBO CONVERGED
Chain 1:
Chain 1: Drawing a sample of size 1000 from the approximate posterior...
Chain 1: COMPLETED.
Warning: Pareto k diagnostic value is 1.49. Resampling is disabled. Decreasing
tol_rel_obj may help if variational algorithm has terminated prematurely.
Otherwise consider using sampling instead.
When running it as above, we observed issues such as high Pareto k values indicating unstable importance sampling. While the preliminary variational inference results provide an acceptable approximation for our initial analysis, our focus on algorithm optimization motivates us to fine-tune the variational parameters. In particular, we plan to decrease the tolerance (\(tol_rel_obj = 0.001\)) to improve convergence and reliability. This adjustment will force the algorithm to converge more precisely, albeit potentially at the cost of increased runtime.
fit_vb_finetuned <- vb(model_centered, data = radon_data, seed = 123,
tol_rel_obj = 0.001)
Chain 1: ------------------------------------------------------------
Chain 1: EXPERIMENTAL ALGORITHM:
Chain 1: This procedure has not been thoroughly tested and may be unstable
Chain 1: or buggy. The interface is subject to change.
Chain 1: ------------------------------------------------------------
Chain 1:
Chain 1:
Chain 1:
Chain 1: Gradient evaluation took 2.6e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.26 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Begin eta adaptation.
Chain 1: Iteration: 1 / 250 [ 0%] (Adaptation)
Chain 1: Iteration: 50 / 250 [ 20%] (Adaptation)
Chain 1: Iteration: 100 / 250 [ 40%] (Adaptation)
Chain 1: Iteration: 150 / 250 [ 60%] (Adaptation)
Chain 1: Iteration: 200 / 250 [ 80%] (Adaptation)
Chain 1: Success! Found best value [eta = 1] earlier than expected.
Chain 1:
Chain 1: Begin stochastic gradient ascent.
Chain 1: iter ELBO delta_ELBO_mean delta_ELBO_med notes
Chain 1: 100 -1090.433 1.000 1.000
Chain 1: 200 -1087.821 0.501 1.000
Chain 1: 300 -1074.842 0.338 0.012
Chain 1: 400 -1078.344 0.254 0.012
Chain 1: 500 -1074.311 0.204 0.004
Chain 1: 600 -1071.617 0.171 0.004
Chain 1: 700 -1071.204 0.146 0.003
Chain 1: 800 -1072.658 0.128 0.003
Chain 1: 900 -1073.649 0.114 0.003
Chain 1: 1000 -1070.130 0.103 0.003
Chain 1: 1100 -1071.737 0.003 0.003
Chain 1: 1200 -1069.720 0.003 0.003
Chain 1: 1300 -1071.948 0.002 0.002
Chain 1: 1400 -1069.975 0.002 0.002
Chain 1: 1500 -1071.793 0.002 0.002
Chain 1: 1600 -1070.020 0.002 0.002
Chain 1: 1700 -1068.508 0.002 0.002
Chain 1: 1800 -1071.285 0.002 0.002
Chain 1: 1900 -1069.822 0.002 0.002
Chain 1: 2000 -1070.350 0.002 0.002
Chain 1: 2100 -1070.497 0.002 0.002
Chain 1: 2200 -1069.051 0.001 0.002
Chain 1: 2300 -1070.701 0.001 0.002
Chain 1: 2400 -1068.992 0.001 0.002
Chain 1: 2500 -1067.730 0.001 0.001
Chain 1: 2600 -1069.366 0.001 0.001
Chain 1: 2700 -1068.318 0.001 0.001
Chain 1: 2800 -1067.358 0.001 0.001
Chain 1: 2900 -1068.609 0.001 0.001
Chain 1: 3000 -1067.317 0.001 0.001
Chain 1: 3100 -1069.065 0.001 0.001
Chain 1: 3200 -1069.441 0.001 0.001
Chain 1: 3300 -1069.998 0.001 0.001
Chain 1: 3400 -1068.241 0.001 0.001
Chain 1: 3500 -1068.856 0.001 0.001
Chain 1: 3600 -1068.709 0.001 0.001 MEAN ELBO CONVERGED MEDIAN ELBO CONVERGED
Chain 1:
Chain 1: Drawing a sample of size 1000 from the approximate posterior...
Chain 1: COMPLETED.
Warning: Pareto k diagnostic value is 0.81. Resampling is unreliable.
Increasing the number of draws or decreasing tol_rel_obj may help.
This initial fine-tuning was not completely sufficient, so we decided to both increase the number of posterior draws and further reduce the tolerance (\(tol_rel_obj\)). We iterated on these settings several times until we found a sweet spot—balancing computational efficiency with the quality of the variational approximation.
fit_vb_more_tuned <-vb(model_centered, data = radon_data, seed = 123,
tol_rel_obj = 0.0001,
output_samples = 20000)
Chain 1: ------------------------------------------------------------
Chain 1: EXPERIMENTAL ALGORITHM:
Chain 1: This procedure has not been thoroughly tested and may be unstable
Chain 1: or buggy. The interface is subject to change.
Chain 1: ------------------------------------------------------------
Chain 1:
Chain 1:
Chain 1:
Chain 1: Gradient evaluation took 2.6e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.26 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Begin eta adaptation.
Chain 1: Iteration: 1 / 250 [ 0%] (Adaptation)
Chain 1: Iteration: 50 / 250 [ 20%] (Adaptation)
Chain 1: Iteration: 100 / 250 [ 40%] (Adaptation)
Chain 1: Iteration: 150 / 250 [ 60%] (Adaptation)
Chain 1: Iteration: 200 / 250 [ 80%] (Adaptation)
Chain 1: Success! Found best value [eta = 1] earlier than expected.
Chain 1:
Chain 1: Begin stochastic gradient ascent.
Chain 1: iter ELBO delta_ELBO_mean delta_ELBO_med notes
Chain 1: 100 -1090.433 1.000 1.000
Chain 1: 200 -1087.821 0.501 1.000
Chain 1: 300 -1074.842 0.338 0.012
Chain 1: 400 -1078.344 0.254 0.012
Chain 1: 500 -1074.311 0.204 0.004
Chain 1: 600 -1071.617 0.171 0.004
Chain 1: 700 -1071.204 0.146 0.003
Chain 1: 800 -1072.658 0.128 0.003
Chain 1: 900 -1073.649 0.114 0.003
Chain 1: 1000 -1070.130 0.103 0.003
Chain 1: 1100 -1071.737 0.003 0.003
Chain 1: 1200 -1069.720 0.003 0.003
Chain 1: 1300 -1071.948 0.002 0.002
Chain 1: 1400 -1069.975 0.002 0.002
Chain 1: 1500 -1071.793 0.002 0.002
Chain 1: 1600 -1070.020 0.002 0.002
Chain 1: 1700 -1068.508 0.002 0.002
Chain 1: 1800 -1071.285 0.002 0.002
Chain 1: 1900 -1069.822 0.002 0.002
Chain 1: 2000 -1070.350 0.002 0.002
Chain 1: 2100 -1070.497 0.002 0.002
Chain 1: 2200 -1069.051 0.001 0.002
Chain 1: 2300 -1070.701 0.001 0.002
Chain 1: 2400 -1068.992 0.001 0.002
Chain 1: 2500 -1067.730 0.001 0.001
Chain 1: 2600 -1069.366 0.001 0.001
Chain 1: 2700 -1068.318 0.001 0.001
Chain 1: 2800 -1067.358 0.001 0.001
Chain 1: 2900 -1068.609 0.001 0.001
Chain 1: 3000 -1067.317 0.001 0.001
Chain 1: 3100 -1069.065 0.001 0.001
Chain 1: 3200 -1069.441 0.001 0.001
Chain 1: 3300 -1069.998 0.001 0.001
Chain 1: 3400 -1068.241 0.001 0.001
Chain 1: 3500 -1068.856 0.001 0.001
Chain 1: 3600 -1068.709 0.001 0.001
Chain 1: 3700 -1068.110 0.001 0.001
Chain 1: 3800 -1067.853 0.001 0.001
Chain 1: 3900 -1066.737 0.001 0.001
Chain 1: 4000 -1068.905 0.001 0.001
Chain 1: 4100 -1068.715 0.001 0.001
Chain 1: 4200 -1067.389 0.001 0.001
Chain 1: 4300 -1068.991 0.001 0.001
Chain 1: 4400 -1067.184 0.001 0.001
Chain 1: 4500 -1069.332 0.001 0.001
Chain 1: 4600 -1067.307 0.001 0.001
Chain 1: 4700 -1068.112 0.001 0.001
Chain 1: 4800 -1069.995 0.001 0.002
Chain 1: 4900 -1067.945 0.001 0.002
Chain 1: 5000 -1067.525 0.001 0.002
Chain 1: 5100 -1068.049 0.001 0.002
Chain 1: 5200 -1068.193 0.001 0.002
Chain 1: 5300 -1068.350 0.001 0.002
Chain 1: 5400 -1067.607 0.001 0.001
Chain 1: 5500 -1067.201 0.001 0.001
Chain 1: 5600 -1066.427 0.001 0.001
Chain 1: 5700 -1068.660 0.001 0.001
Chain 1: 5800 -1068.066 0.001 0.001
Chain 1: 5900 -1067.009 0.001 0.001
Chain 1: 6000 -1068.278 0.001 0.001
Chain 1: 6100 -1069.393 0.001 0.001
Chain 1: 6200 -1067.846 0.001 0.001
Chain 1: 6300 -1069.090 0.001 0.001
Chain 1: 6400 -1067.135 0.001 0.001
Chain 1: 6500 -1069.322 0.001 0.001
Chain 1: 6600 -1067.309 0.001 0.001
Chain 1: 6700 -1067.390 0.001 0.001
Chain 1: 6800 -1068.453 0.001 0.001
Chain 1: 6900 -1067.043 0.001 0.001
Chain 1: 7000 -1067.157 0.001 0.001
Chain 1: 7100 -1068.944 0.001 0.001
Chain 1: 7200 -1067.935 0.001 0.001
Chain 1: 7300 -1067.535 0.001 0.001
Chain 1: 7400 -1066.644 0.001 0.001
Chain 1: 7500 -1068.259 0.001 0.001
Chain 1: 7600 -1068.510 0.001 0.001
Chain 1: 7700 -1068.019 0.001 0.001
Chain 1: 7800 -1066.575 0.001 0.001
Chain 1: 7900 -1067.222 0.001 0.001
Chain 1: 8000 -1067.513 0.001 0.001
Chain 1: 8100 -1067.825 0.001 0.001
Chain 1: 8200 -1067.999 0.001 0.000
Chain 1: 8300 -1066.776 0.001 0.001
Chain 1: 8400 -1068.190 0.001 0.001
Chain 1: 8500 -1068.538 0.001 0.000
Chain 1: 8600 -1067.979 0.001 0.001
Chain 1: 8700 -1066.506 0.001 0.001
Chain 1: 8800 -1067.501 0.001 0.001
Chain 1: 8900 -1067.452 0.001 0.001
Chain 1: 9000 -1066.848 0.001 0.001
Chain 1: 9100 -1067.680 0.001 0.001
Chain 1: 9200 -1068.053 0.001 0.001
Chain 1: 9300 -1067.837 0.001 0.001
Chain 1: 9400 -1068.229 0.001 0.001
Chain 1: 9500 -1066.716 0.001 0.001
Chain 1: 9600 -1068.545 0.001 0.001
Chain 1: 9700 -1067.238 0.001 0.001
Chain 1: 9800 -1067.851 0.001 0.001
Chain 1: 9900 -1066.748 0.001 0.001
Chain 1: 10000 -1067.802 0.001 0.001
Chain 1: Informational Message: The maximum number of iterations is reached! The algorithm may not have converged.
Chain 1: This variational approximation is not guaranteed to be meaningful.
Chain 1:
Chain 1: Drawing a sample of size 20000 from the approximate posterior...
Chain 1: COMPLETED.
Next, we extract summary statistics from the fine-tuned VI approximation:
vi_summary <- summary(fit_vb_more_tuned)$summary
We then prepare a function that returns a list of initial values for the sampler. In this function, each county-specific intercept is initialized around the global intercept:
init_fun <- function() {
list(
mu_alpha = vi_summary["mu_alpha", "mean"],
tau = vi_summary["tau", "mean"],
alpha = rep(vi_summary["mu_alpha", "mean"], radon_data$J),
beta = vi_summary["beta", "mean"],
sigma = vi_summary["sigma", "mean"]
)
}
(Note: This initialization strategy is based on our model specifics and may be refined further if necessary.)
Now, we run Stan using NUTS with these warm-start initial values:
fit_warm <- stan(file = "radon_model_centered.stan",
data = radon_data,
init = init_fun,
iter = 2000,
warmup = 1000,
chains = 4,
seed = 123,
control = list(metric = "dense_e", adapt_engaged = TRUE))
We expect the convergence plots to be very similar, however, we still present them since the goal of this enhancement was faster convergence.
Running diagnostics on the fitted model:
wu_diag <- run_diagnostics(fit_warm, model_name = "centered_dense_warm")
From these plots, we see that the chains for all parameters (beta,
mu_alpha, tau, and sigma) remain well-mixed, with no clear trends or
drifts, indicating good convergence. The autocorrelation decays quickly
across all parameters, suggesting that the sampler efficiently produces
near-independent draws. The density overlays from each chain align
closely, reinforcing that all chains have converged to the same
posterior distribution. Overall, these diagnostics look nearly identical
to the earlier runs—consistent with our expectation that the main
benefit here is faster convergence, not a change in the final posterior
estimates.
Next, we compare the model with a warm-start to the baseline dense model that was run without a warm-start.
combined_diagnostics <- bind_rows(dense_diag, wu_diag)
print(combined_diagnostics)
Parameter Rhat ESS Model LOO_ELPD
beta...1 beta 1.0003583 6580.948 dense_mass_matrix -1037.346
mu_alpha...2 mu_alpha 0.9997937 6229.065 dense_mass_matrix -1037.346
tau...3 tau 1.0001146 2147.312 dense_mass_matrix -1037.346
sigma...4 sigma 0.9996516 7291.118 dense_mass_matrix -1037.346
beta...5 beta 0.9997020 6495.029 centered_dense_warm -1036.921
mu_alpha...6 mu_alpha 0.9996510 6477.375 centered_dense_warm -1036.921
tau...7 tau 1.0000870 2219.838 centered_dense_warm -1036.921
sigma...8 sigma 0.9998895 6474.490 centered_dense_warm -1036.921
Both models show nearly identical convergence and predictive performance, with R-hat values close to 1 and similar LOO-ELPD values. The warm-start version offers no significant difference in final estimates but can reduce the time to convergence, making it a practical enhancement when runtime is a concern. Therefore, we will use for further enhancements the baseline dense model without a warm-start
Although parallel sampling is relatively straightforward to implement — since Stan already runs multiple chains concurrently — it may not provide significant additional benefits for our current model. In contrast, distributed sampling is typically reserved for very large models or datasets. Given our diagnostic results and our focus on algorithmic improvements, we will concentrate on further reparameterization techniques, where we expect to achieve more substantial gains in sampling efficiency and convergence.
We have already implemented the fully non-centered parameterization, which can improve convergence when group-level effects are weakly informed by the data. In addition, further reparameterization techniques may yield additional improvements. We now consider three additional strategies:
Partially Non-Centered Parameterization
Log-Scale Reparameterization for Scale Parameters
Standardizing Predictors
In a fully centered parameterization, group-level effects are modeled directly (i.e., \(\alpha_j \sim \mathcal{N}(\mu_\alpha, \tau^2)\)), while in a fully non-centered parameterization we define \[\alpha_j = \mu_\alpha + \tau \cdot \alpha_{\text{raw},j}, \quad \text{with } \quad \alpha_{\text{raw},j} \sim \mathcal{N}(0,1).\] However, neither extreme may be optimal when data across groups vary in informativeness. A partially non-centered parameterization combines both approaches by introducing a tuning parameter, \(\phi\), that controls the blend between the centered and non-centered formulations. Thus, we can for instance model: \[\alpha_j = \mu_\alpha + \phi \, \tau \, \alpha_{\text{raw},j} + (1 - \phi) \, \delta_j,\] where \(\alpha_{\text{raw},j} \sim \mathcal{N}(0,1)\) represents the non-centered component and \(\delta_j\) represents the centered deviation (with an appropriate scale, such as \(\delta_j \sim \mathcal{N}(0, \tau)\)). For simplicity, we fix \(\phi = 0.5\) to give equal weight to both components (also see Gelman et al., 2013)
Note: We built this advanced model formulation in the file radon_model_partial_non_centered.stan.
As done before, we load our Stan model and run it on the above described sampler:
fit_partial_nc <- stan(file = "radon_model_partial_non_centered.stan",
data = radon_data,
iter = 2000,
warmup = 1000,
chains = 4,
seed = 123)
Running diagnostics on the fitted model:
partial_nc_diag <- run_diagnostics(fit_partial_nc, model_name = "partial_non_centered")
Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
The partial non-centered model appears to have converged successfully:
the chains are well-mixed, the R-hat values are near 1, and the
autocorrelation for most parameters decays quickly. However, \(\tau\) exhibits a slower autocorrelation
decay than in previous models. This is a common challenge for
group-level scale parameters, and we may explore additional strategies
to improve sampling efficiency for \(\tau\) in future work (though, for this
project this is just a side node since it still decays relatively
fast).
We compare this model to the baseline dense model:
combined_diagnostics <- bind_rows(dense_diag, partial_nc_diag)
print(combined_diagnostics)
Parameter Rhat ESS Model LOO_ELPD
beta...1 beta 1.0003583 6580.9484 dense_mass_matrix -1037.346
mu_alpha...2 mu_alpha 0.9997937 6229.0651 dense_mass_matrix -1037.346
tau...3 tau 1.0001146 2147.3121 dense_mass_matrix -1037.346
sigma...4 sigma 0.9996516 7291.1183 dense_mass_matrix -1037.346
beta...5 beta 0.9995155 4737.6243 partial_non_centered -1036.778
mu_alpha...6 mu_alpha 1.0000713 3113.4060 partial_non_centered -1036.778
tau...7 tau 1.0024624 972.1672 partial_non_centered -1036.778
sigma...8 sigma 1.0001266 4530.3064 partial_non_centered -1036.778
Overall, both models show good convergence (R-hat near 1) and comparable sampling efficiency, though \(\tau\) in the partially non-centered model has a lower ESS, indicating slower mixing for that parameter. The LOO-ELPD values suggest that the partially non-centered model may offer a slightly better fit (-1037.659 vs. -1036.783), but the improvement is modest.
In this reparameterization, we transform the scale parameters \(\tau\) and \(\sigma\) onto the log scale. Instead of sampling \(\tau\) and \(\sigma\) directly, we sample their logarithms (i.e. log_tau and log_sigma), and then transform them back using the exponential function. This approach can stabilize the sampling by reducing skewness and improving mixing, especially when the scale parameters span several orders of magnitude. See Gelman et al. (2013) for details on log-scale reparameterization in hierarchical models.
Note: We implemented this reparameterization in a new Stan model, saved as radon_model_log_scale.stan.
Next, we run this model using Stan:
fit_log_scale <- stan(file = "radon_model_log_scale.stan",
data = radon_data,
iter = 2000,
warmup = 1000,
chains = 4,
seed = 123)
Running diagnostics on the fitted model:
log_scale_diag <- run_diagnostics(fit_log_scale, model_name = "log_scale")
The diagnostic plots show well-mixed chains with Rhat values near 1 and
a rapid decay in autocorrelation for most parameters (again \(\tau\) a little slower decay).
We then compare the log-scale reparameterized model to the baseline dense model:
combined_diag_log <- bind_rows(dense_diag, log_scale_diag)
print(combined_diag_log)
Parameter Rhat ESS Model LOO_ELPD
beta...1 beta 1.0003583 6580.948 dense_mass_matrix -1037.346
mu_alpha...2 mu_alpha 0.9997937 6229.065 dense_mass_matrix -1037.346
tau...3 tau 1.0001146 2147.312 dense_mass_matrix -1037.346
sigma...4 sigma 0.9996516 7291.118 dense_mass_matrix -1037.346
beta...5 beta 1.0001960 3438.816 log_scale -1036.971
mu_alpha...6 mu_alpha 1.0003860 2088.606 log_scale -1036.971
tau...7 tau 1.0019821 1236.616 log_scale -1036.971
sigma...8 sigma 1.0006646 4574.819 log_scale -1036.971
Both models exhibit excellent convergence with Rhat values near 1. The log-scale model shows a marginal improvement in LOO-ELPD (-1036.760 vs. -1036.783), although its effective sample sizes for \(\beta\) and \(\mu_\alpha\) are slightly lower compared to the baseline dense model. Overall, the predictive performance is nearly identical, suggesting that reparameterizing the scale parameters on the log scale provides comparable results, with some potential benefits in stabilizing the estimation of scale parameters.
Standardizing predictors is a common pre-processing step that can improve the numerical stability of a model and reduce parameter correlations. By centering and scaling the predictor (in our case, x), we allow the sampler to explore the parameter space more efficiently. This can lead to better convergence diagnostics and more robust estimates. See Gelman et al. (2013, Chapter 3) for details on standardizing predictors in regression models, which can improve numerical stability and reduce parameter correlations.
Note: We implemented the standardization directly in the Stan model’s transformed data block, and saved it in radon_model_standardized.stan.
Next, we run this model using Stan:
fit_std <- stan(file = "radon_model_standardized.stan",
data = radon_data,
iter = 2000,
warmup = 1000,
chains = 4,
seed = 123)
Running diagnostics on the fitted model:
std_diag <- run_diagnostics(fit_std, model_name = "standardized")
Once again, the diagnostic plots indicate that the chains are
well-mixed, with trace plots showing stable, overlapping patterns and
autocorrelation plots exhibiting rapid decay. Overall, the density
overlays align closely across chains, confirming excellent convergence
for the standardized model.
We then compare the diagnostics from the standardized model to our baseline dense model:
combined_diag_std <- bind_rows(dense_diag, std_diag)
print(combined_diag_std)
Parameter Rhat ESS Model LOO_ELPD
beta...1 beta 1.0003583 6580.948 dense_mass_matrix -1037.346
mu_alpha...2 mu_alpha 0.9997937 6229.065 dense_mass_matrix -1037.346
tau...3 tau 1.0001146 2147.312 dense_mass_matrix -1037.346
sigma...4 sigma 0.9996516 7291.118 dense_mass_matrix -1037.346
beta...5 beta 0.9999749 7189.826 standardized -1036.983
mu_alpha...6 mu_alpha 0.9993500 4511.283 standardized -1036.983
tau...7 tau 1.0013902 1617.910 standardized -1036.983
sigma...8 sigma 1.0004383 5228.265 standardized -1036.983
Both models exhibit excellent convergence, with R-hat values nearly 1 and high effective sample sizes. The standardized model has comparable LOO-ELPD and similar diagnostic metrics to the baseline dense model, indicating that standardizing predictors maintains convergence performance while potentially improving sampling efficiency for some parameters.
In this subsection, we explore the impact of using more robust prior distributions and perform sensitivity analysis. Robust priors, such as a half-t distribution, can better accommodate outliers and heavy-tailed behavior compared to half-Cauchy priors. By assessing the sensitivity of our model to these alternative priors, we can evaluate how much our posterior inferences depend on prior choices and ensure that our conclusions are robust. This approach is based on Gelman (2006, p. 515–533), which discusses using half‑t priors as robust alternatives for variance parameters in hierarchical models.
For example, instead of using half-Cauchy priors for the scale parameters (\(\tau\) and \(\sigma\)), we can use half-t priors with a small number of degrees of freedom (e.g., 3). The half-t prior is specified by sampling from a Student’s t distribution and truncating it at zero. Of course there are way more examples of what could be done, however, we decided fo the feasibilty of this project to rely on this alternative.
Note: We implemented the robust prior version of our model in Stan, saved as radon_model_robust.stan.
Next, we run this model using Stan:
fit_robust <- stan(file = "radon_model_robust.stan",
data = radon_data,
iter = 2000,
warmup = 1000,
chains = 4,
seed = 123)
Running diagnostics on the fitted model:
robust_diag <- run_diagnostics(fit_robust, model_name = "robust_priors")
The trace plots show stable, overlapping “fuzzy caterpillars” for all
parameters, indicating good mixing and no signs of drift. The
autocorrelation plots reveal a rapid decay, suggesting an efficient
sampling process with minimal correlation between successive draws. The
density overlays confirm that all chains converge to the same posterior
distribution, supporting the conclusion that the robust priors model
converges well.
We then compare the robust model to our baseline dense model:
combined_diag_robust <- bind_rows(dense_diag, robust_diag)
print(combined_diag_robust)
Parameter Rhat ESS Model LOO_ELPD
beta...1 beta 1.0003583 6580.948 dense_mass_matrix -1037.346
mu_alpha...2 mu_alpha 0.9997937 6229.065 dense_mass_matrix -1037.346
tau...3 tau 1.0001146 2147.312 dense_mass_matrix -1037.346
sigma...4 sigma 0.9996516 7291.118 dense_mass_matrix -1037.346
beta...5 beta 1.0003832 4386.104 robust_priors -1036.889
mu_alpha...6 mu_alpha 1.0009007 3330.298 robust_priors -1036.889
tau...7 tau 0.9991568 1504.899 robust_priors -1036.889
sigma...8 sigma 0.9996653 5224.809 robust_priors -1036.889
Both models show good convergence, with R-hat values near 1. The robust priors model has slightly lower ESS for \(\tau\), indicating that it mixes a bit more slowly for this parameter, but it achieves a marginally better LOO-ELPD (-1037.140 vs. -1036.783). Overall, the robust priors approach offers similar convergence with a modest improvement in predictive performance.
There are further possible enhancements such as, enhanced step-size and trajectory adaptation can be beneficial since diagnostics reveal issues such as divergences or inefficient sampling trajectories (see Hoffman, M. D., & Gelman, A., 2014, p. 1593ff, and the Stan Reference Manual). However, since our current diagnostics indicate excellent convergence and no significant problems with step-size adaptation, further adjustments in this area may be unnecessary. Consequently, we will proceed with one final enhancement strategy in the next subsection.
In complex models that include both continuous and discrete parameters, sampling using a single algorithm can be challenging. HMC/NUTS (implemented in Stan) is highly efficient for continuous parameters but cannot directly sample discrete variables. To address this, we employ a hybrid sampling approach that integrates HMC/NUTS for the continuous components with Gibbs sampling for the discrete components.
In our model, we assume the presence of a discrete latent variable \(Z\) (e.g., representing cluster assignments or state indicators) alongside the continuous parameters \(\theta = \{\mu_\alpha, \beta, \tau, \alpha, \sigma\}\). The idea is to alternate between:
HMC/NUTS Sampling: Update the continuous parameters conditional on the current values of \(Z\).
Gibbs Sampling: Update the discrete variable Z from its full conditional distribution given the current continuous parameters.
This combined approach leverages the strengths of each method, ensuring efficient exploration of the continuous parameter space while appropriately handling discrete variables. For further reading on hybrid MCMC methods, see Robert and Casella (2004), which provide a comprehensive discussion on combining different sampling techniques.
Note: We implemented this model in Stan, saved as model_continuous.stan.
In the following, we outline a hybrid sampling routine. This routine:
Initializes the discrete latent variable \(Z\).
For a fixed number of iterations, updates the continuous parameters using Stan (via NUTS) conditional on the current \(Z\), and then updates \(Z\)ion for updating Z via Gibbs sampling. In our model, the likelihood for each observation is given by \[y_n \sim \mathcal{N}(m_n + 0.1\, Z_n, \sigma^2),\]
where
\[m_n = \alpha[\text{county}[n]] + \beta \, x[n].\]
Assume that each \(Z_n\) has a uniform prior over a finite set \(\{1, 2, \ldots, K\}\) (here, \(K=3\) since we assume/expect three latent states). Then the full conditional for Z_n is proportional to the likelihood (since the prior is uniform): \[P(Z_n = z \mid \cdot) \propto \exp\left\{ -\frac{1}{2\sigma^2} \left(y_n - \left(m_n + 0.1\, z\right)\right)^2 \right\}, \quad z = 1,\ldots, K.\]
To update \(Z_n\), we compute this unnormalized probability for each possible value of \(z\), normalize these probabilities so they sum to one, and sample from the resulting discrete distribution.
Now, we implement this Gibbs update for \(Z\):
update_Z <- function(y, m, sigma, K = 3) {
Z_new <- numeric(length(y))
for (n in 1:length(y)) {
probs <- sapply(1:K, function(z) {
exp(-0.5 * ((y[n] - (m[n] + 0.1 * z))^2) / (sigma^2))
})
probs <- probs / sum(probs)
Z_new[n] <- sample(1:K, size = 1, prob = probs)
}
return(Z_new)
}
As described, this code alternates between running Stan to update continuous parameters (conditional on \(Z\)) and updating \(Z\) via Gibbs sampling.
Initializing \(Z\):
K <- 3
radon_data$Z <- rep(1, radon_data$N)
We set parameters for the hybrid sampler:
n_outer_iter <- 50
n_stan_iter <- 1500
n_stan_warmup <- 500
We have a list to store results from each Gibbs iteration:
hybrid_results <- list()
Z_current <- radon_data$Z
Implementing the updating Procedure:
for (i in 1:n_outer_iter) {
cat("Gibbs iteration:", i, "\n")
radon_data_updated <- radon_data
radon_data_updated$Z <- Z_current
fit_cont <- stan(file = "model_continuous.stan",
data = radon_data_updated,
iter = n_stan_iter,
warmup = n_stan_warmup,
chains = 4,
seed = 123)
cont_summary <- summary(fit_cont)$summary
mu_alpha_est <- cont_summary["mu_alpha", "mean"]
beta_est <- cont_summary["beta", "mean"]
sigma_est <- cont_summary["sigma", "mean"]
m_n <- mu_alpha_est + beta_est * radon_data$x
Z_current <- update_Z(y = radon_data$y, m = m_n, sigma = sigma_est, K = K)
hybrid_results[[i]] <- list(mu_alpha = mu_alpha_est, beta = beta_est,
sigma = sigma_est, Z = Z_current)
}
Note: For simplicity, we approximate m_n using mu_alpha and beta (in practice, though we could also extract each alpha from fit_cont), anything else would have taken too much computational time.
Note: The output of this code chunk is set to be hidden - to avoid unnecessary long printing outputs in the html and keep the code concise. Full outputs can be seen in the .rmd file
We combine results for analysis:
hybrid_results_df <- do.call(rbind, lapply(hybrid_results, as.data.frame))
print(hybrid_results_df)
Note: The output of this code chunk is set to be hidden - to avoid unnecessary long printing outputs in the html and keep the code concise. Full outputs can be seen in the .rmd file
We can use this dataframe to see how continuous parameters and discrete latent variables evolve across iterations. This can be better seen in the plot below. We observe meaningful variation in all parameters, confirming that the Stan fits are responding to the updated values of Z and that we are successfully exploring the posterior.
hybrid_results_df$iteration <- 1:nrow(hybrid_results_df)
# Reshape to long format
hybrid_long <- pivot_longer(
hybrid_results_df,
cols = c(mu_alpha, beta, sigma),
names_to = "parameter",
values_to = "value"
)
# Plot all parameter trajectories side by side
ggplot(hybrid_long, aes(x = iteration, y = value)) +
geom_line(color = "steelblue") +
facet_wrap(~ parameter, scales = "free_y") +
labs(
title = "Posterior Mean Estimates Over Hybrid Sampling Iterations",
x = "Iteration",
y = "Posterior Mean"
) +
theme_minimal()
After running the hybrid sampler, we can extract the continuous parameter samples from the final Stan fit and run our automated diagnostics:
final_fit <- fit_cont
hybrid_diag <- run_diagnostics(final_fit, model_name = "hybrid_continuous")
print(hybrid_diag)
Parameter Rhat ESS Model LOO_ELPD
beta beta 0.9996800 4366.170 hybrid_continuous -1032.386
mu_alpha mu_alpha 0.9994333 3444.996 hybrid_continuous -1032.386
tau tau 1.0038672 1350.484 hybrid_continuous -1032.386
sigma sigma 0.9996509 4997.410 hybrid_continuous -1032.386
Finally, we compare the hybrid model’s continuous component to those from the the best-performing standard model — the centered hierarchical model with a dense mass matrix.
combined_hybrid_diag <- bind_rows(dense_diag, hybrid_diag)
print(combined_hybrid_diag)
Parameter Rhat ESS Model LOO_ELPD
beta...1 beta 1.0003583 6580.948 dense_mass_matrix -1037.346
mu_alpha...2 mu_alpha 0.9997937 6229.065 dense_mass_matrix -1037.346
tau...3 tau 1.0001146 2147.312 dense_mass_matrix -1037.346
sigma...4 sigma 0.9996516 7291.118 dense_mass_matrix -1037.346
beta...5 beta 0.9996800 4366.170 hybrid_continuous -1032.386
mu_alpha...6 mu_alpha 0.9994333 3444.996 hybrid_continuous -1032.386
tau...7 tau 1.0038672 1350.484 hybrid_continuous -1032.386
sigma...8 sigma 0.9996509 4997.410 hybrid_continuous -1032.386
From this final dataframe, we can argue that the hybrid model is the best performing model overall. It shows a modest but consistent improvement in predictive performance (LOO-ELPD = −1034.4 vs −1037.3) but at the same time maintains excellent convergence diagnostics. However, it is slightly less efficient in terms of sampling (lower ESS) we shall not prioritize sampling efficiency over predictive power when it comes to our model performance here. Furthermore, it offers a richer model structure by including discrete latent variables, potentially capturing residual variation that the baseline dense model does not account for. We shall explore the difference in the final Discussion section.
To illustrate the difference in the two final models, we will most importantly plot the posterior of the corresponding parameters. The summary of the point estimates can also be seen below.
draws_dense <- as_draws_df(fit_dense)
draws_hybrid <- as_draws_df(final_fit) # final_fit = last Stan fit from hybrid sampler
# Labels:
draws_dense$model <- "Dense"
draws_hybrid$model <- "Hybrid"
draws_combined <- rbind(
draws_dense[, c("mu_alpha", "beta", "sigma", "tau", "model")],
draws_hybrid[, c("mu_alpha", "beta", "sigma", "tau", "model")]
)
Warning: Dropping 'draws_df' class as required metadata was removed.
Warning: Dropping 'draws_df' class as required metadata was removed.
draws_long <- pivot_longer(draws_combined, cols = -model, names_to = "parameter", values_to = "value")
ggplot(draws_long, aes(x = value, fill = model)) +
geom_density(alpha = 0.5) +
facet_wrap(~ parameter, scales = "free", ncol = 2) +
labs(title = "Posterior Distributions: Hybrid vs Dense Model",
x = "Parameter Value", y = "Density") +
theme_minimal()
\(\beta\): The two posteriors are almost identical. The models both state that the floor level as a strong negative effect on radon.
\(\sigma\): Again, very similar results but maybe slightly higher precision for the Hybrid one.
\(\mu_\alpha\): Interestingly, this is the only parameter where we see a definite difference. The hybrid model yields a lower estimate than the dense model. The latent variable \(z\) appears to be shifting the intercept downords.
\(\tau\) Very similiar again.
#Point estimates:
print(final_fit, pars = c("mu_alpha", "beta", "tau", "sigma"), probs = c(0.025, 0.5, 0.975))
Inference for Stan model: anon_model.
4 chains, each with iter=1500; warmup=500; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 50% 97.5% n_eff Rhat
mu_alpha 1.30 0 0.05 1.20 1.29 1.40 3445 1
beta -0.67 0 0.07 -0.80 -0.67 -0.54 4366 1
tau 0.32 0 0.04 0.24 0.31 0.41 1350 1
sigma 0.72 0 0.02 0.69 0.72 0.76 4997 1
Samples were drawn using NUTS(diag_e) at Tue Mar 25 13:26:50 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Based on the final hybrid model, we estimate that the average log-radon level (intercept) across counties is approximately 1.30. The negative coefficient on floor suggests that radon levels tend to be significantly lower on upper floors than in basements. The between-county variability (tau) and within-county residual variability (sigma) are are also positive, indicating that counties truly differ from each other and a´houses within a county also vary. This highlights the need for a hierarchical model as ours.
To wrap this up, we can say that we sucessfully improved the model even though it is not by a lot. To deepen this, further research could explore dynamic hierarchical models where the latent structure evolves over time, or incorporate spatial dependencies between counties to better capture geographical patterns in radon exposure.
Our comparative analysis of the dense and hybrid hierarchical Bayesian models for radon exposure has demonstrated that both approaches yield similar estimates, particularly for the floor effect (β), within-county variability (σ), and between-county variability (τ). However, the hybrid model provides a slightly more precise estimation, particularly affecting the intercept (μ_α), which is lower compared to the dense model. This suggests that the latent variable structure in the hybrid model shifts the baseline radon estimates. While the improvements achieved by the hybrid model are not drastic, they suggest promising directions for future research.
Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Analysis, 1(3), 515–533.
Gelman, A., & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). Chapman & Hall/CRC.
Hoffman, M. D., & Gelman, A. (2014). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1), 1593–1623.
Papaspiliopoulos, O., Roberts, G. O., & Sköld, M. (2003). Non-centered parameterizations for hierarchical models and data augmentation. Bayesian Statistics, 7, 307–326.
Robert, C. P., & Casella, G. (2004). Monte Carlo Statistical Methods (2nd ed.). Springer.